The  AFRL-MITLL  WMT15  System: 

There’s  More  than  One  Way  to  Decode  It! 

Jeremy  Gwinnup^,  Timothy  Anderson,  Michaeel  Kazi^,  Elizabeth  Salesky^, 
Grant  Erdmann,  Katherine  Young^^,  Brian  Thompson^ 

Christina  May^  MIT  Lincoln  Laboratory 

Air  Force  Research  Laboratory  michaeel .kazi,elizabeth. sale sky, 

jeremy . gwinnup . ctr , timothy .anderson.20,  brian. thompson@ll .mit . edu 

grant . erdmann, katherine . young . 1 . ctr, 

Christina. may . ctr@us . af . mil 


Abstract 

This  paper  describes  the  AFRL-MITLL 
statistical  MT  system  and  the  improve¬ 
ments  that  were  developed  during  the 
WMT15  evaluation  campaign.  As  part  of 
these  efforts  we  experimented  with  a  num¬ 
ber  of  extensions  to  the  standard  phrase- 
based  model  that  improve  performance  on 
the  Russian  to  English  translation  task  cre¬ 
ating  three  submission  systems  with  dif¬ 
ferent  decoding  strategies.  Out  of  vocabu¬ 
lary  words  were  addressed  with  named  en¬ 
tity  postprocessing. 

1  Introduction 

As  part  of  the  2015  Workshop  on  Machine 
Translation  (WMT15)  shared  translation  task, 
the  MITLL  and  AFRL  human  language  tech¬ 
nology  teams  participated  in  the  Russian-English 
translation  task.  Our  machine  translation  sys¬ 
tems  represent  enhancements  to  both  our  sys¬ 
tems  from  IWSET2014  (Kazi  et  al.,  2014)  and 
WMT14  (Schwartz  et  al.,  2014),  the  addition  of 
hierarchical  decoding  systems  (Hoang  and  Koehn, 
2008),  neural  network  joint  models  (Devlin  et  al., 
2014)  and  the  utilization  of  Drem  (Erdmann  and 
Gwinnup,  2015)  during  the  system  tuning  process. 

2  System  Description 

We  submitted  systems  for  the  Russian-to-English 
machine  translation  shared  task.  In  all  submitted 
systems,  we  used  either  phrase-based  or  hierarchi¬ 
cal  variants  of  the  moses  decoder  (Koehn  et  al., 
2007).  As  in  previous  years,  our  submitted  sys¬ 
tems  used  only  the  constrained  data  supplied  when 
training. 
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2.1  Data  Usage 

In  training  our  Russian-English  systems  we  uti¬ 
lized  the  following  corpora  to  train  translation  and 
language  models:  Yandex\  Commoncrawl  (Smith 
et  al.,  2013),  EDC  Gigaword  English  v5  (Parker 
et  al.,  2011)  and  News  Commentary.  The  Wiki- 
Names  corpus  was  reserved  to  train  named  entity 
recognizers. 

2.2  Data  Preprocessing 

As  with  our  WMT14  submission  systems,  prepro¬ 
cessing  to  address  issues  with  the  training  data  was 
required  to  ensure  optimal  system  performance. 
Unicode  characters  in  the  private  use,  control  char- 
acter(C0,  Cl,  zero-width,  non-breaking,  joiner,  di¬ 
rectionality  and  paragraph  markers),  and  unallo¬ 
cated  ranges  were  removed.  Punctuation  normal¬ 
ization  and  tokenization  using  Moses  preprocess¬ 
ing  scripts  were  then  applied  before  lowercasing 
the  data.  The  Commoncrawl  corpus  was  further 
processed  to  exclude  wrong-language  text  and  to 
normalize  mixed-alphabet  spellings. 

2.3  Factored  Data  Generation 

We  generated  a  class-factored  version  of  the  paral¬ 
lel  Russian-English  training  data  by  using  mkcls 
to  produce  600  word  classes  for  each  side  of  the 
data.  The  factored  data  was  then  used  to  create  a 
factored  translation  model  and  an  in-domain  class 
language  model  (Brown  et  al.,  1993)  for  the  En¬ 
glish  portion. 

2.4  Phrase  and  Rule  Table  Training 

Phrase  tables  and  rule  tables  were  trained  on  the 
preprocessed  data  using  scripts  provided  with  the 
moses  distribution.  Both  rule  tables  and  phrase 
tables  utilized  Good-Turing  discounting  (Gale, 
1995).  Hierarchical  lexicalized  reordering  mod- 
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els  (Galley  and  Manning,  2008)  were  also  trained 
for  use  in  the  phrase-based  systems. 

An  additional  phrase  table  was  trained  on  the 
lemmatized  forms  of  the  Russian  training  data. 
These  lemmatized  forms  were  generated  by  the 
mystem^  tool. 

2.5  Language  Model  Training 

The  English  data  sourees  listed  in  Seetion  2. 1  were 
used  to  train  a  very  large  6-gram  language  model 
(BigLM15).  The  English  portion  of  the  parallel 
data  was  proeessed  into  elass  form  as  outlined  in 
Seetion  2.3  to  generate  an  in-domain  600  elass  lan¬ 
guage  model,  kenlm  (Heafield,  2011)  was  used 
to  train  these  6-gram  models.  These  models  were 
then  binarized  and  stored  on  loeal  solid-state  disks 
for  eaeh  maehine  in  our  eluster  to  improve  load 
time  and  reduee  fileserver  traffie. 

2.6  Operation  Sequence  Models 

Using  both  the  Russian  and  English  data  generated 
in  Seetion  2.3,  we  trained  order-5  Operation  Se- 
quenee  models  (Durrani  et  ah,  2011)  for  both  the 
surfaee  and  elass-faetored  forms  of  the  data.  These 
models  were  then  used  in  our  faetored  phrase- 
based  system. 

2.7  Neural  Network  Joint  Models 

NNJM  (Devlin  et  ah,  2014)  models  were  used 
to  reseore  n-best  lists.  We  trained  these  mod¬ 
els  on  the  alignments  produeed  by  mgiza  (Gao 
and  Vogel,  2008)  over  the  parallel  text.  As 
in  (Devlin  et  ah,  2014),  we  trained  four  differ¬ 
ent  models.  The  standard  model  is  “souree-to- 
target,  left-to-right,”  whieh  evaluates  p{ti\T,S) 
with  target  window  T  =  U_2, . . . ,  U-n) 

and  S  •  •  •  ?  where  is 

word-aligned  to  U.  The  four  permutations  of  this 
are  defined  by  (a)  whether  to  eount  upwards  from 
i,  instead  of  downwards  (this  is  left-to-right  vs 
right-to-left),  and  (b)  whether  to  swap  the  sourees 
and  targets  entirely  (souree-to-target  vs  target-to- 
souree). 

We  experimented  with  NNJM  deeoding  (via  a 
simple  feature  funetion  in  Moses).  We  aehieved 
some  benefit  (-1-0.48)  with  this  approaeh  but 
reseoring  a  single  NNJM  souree-to-target  on  200- 
best  lists  produeed  better  results  in  this  ease 
(-1-0.90).  This  was  on  a  single  system  tuned 
on  newstest2013,  tested  on  newstest2014 
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(baseline  29.07).  In  testing,  2-hidden  layer  reseor¬ 
ing  models  outperformed  the  1 -hidden  layer  de¬ 
eoding  model. 

Additionally,  we  experimented  with  voeabulary 
sizes  determined  by  minimum  eount,  trying  20  and 
25.  Using  20,  our  voeabulary  was  approximately 
80,000  Russian  words  and  40,000  English;  with 
25,  it  was  70,000  and  34,000,  respeetively.  We 
eompared  reseoring  with  a  single,  standard  model 
(s2t,  12r)  to  reseoring  with  all  direetions  with  re¬ 
sults  listed  in  Table  1.  We  found  that  our  max 
seores  were  better  reseoring  with  all  four  models. 


Baseline 

1  NNJM 

4  NNJMs 

20 

25 

20 

25 

max  27.71 

27.90 

28.05 

27.90 

28.07 

mean  27.48 

27.61 

27.81 

27.67 

27.60 

Table  1:  NNJM  Reseoring  on  newstest2015, 
optimizing  on  newstest2  014. 


2.8  Processing  of  Unknown  Words 

In  our  submission  systems,  we  allowed  words  un¬ 
known  to  the  deeoder  to  be  passed  through  to 
the  translated  output.  The  output  of  phrase -based 
and  hierarehieal  submissions  systems  was  then 
proeessed  with  permissive  named  entity  lookup 
and  seleetive  transliteration.  Our  faetored  phrase- 
based  system’s  output  had  named  entity  tagging 
and  then  permissive  lookup  and  seleetive  translit¬ 
eration  applied.  Both  teehniques  are  deseribed 
below.  Seore  improvements  in  uneased  BEEU 
are  reported  in  Table  2.  We  see  that  applieation 
of  permissive  lookup  and  seleetive  transliteration 
yielded  an  improvement  of  -1-0.48  BEEU  versus  a 
baseline  system,  while  the  applieation  of  named 
entity  proeessing,  permissive  lookup  and  seleetive 
transliteration  yielded  a  -1-0.57  BEEU  gain. 

2.8.1  Processing  Unknown  Words  as  Named 
Entities 

The  named  entity  post-proeess  uses  Russian- 
English  pairs  in  the  eombined  Wikinames  and 
Wikititles  lists  (the  wiki  pairs  list)  and  a 
transliteration-mined  list  to  replaee  unknown 
words  with  English  equivalents.  We  began  by 
stemming  eaeh  list  to  remove  Russian  noun  and 
adjeetive  endings.  To  the  wiki  pairs  list,  we 
added  additional  pairs  yielded  by  replaeing  word- 
internal  punetuation  marks  in  existing  wiki  pairs 
with  spaees.  We  used  giza++  (Oeh  and  Ney, 


2003)  to  align  Russian-English  phrases  from  the 
wiki  list.  We  then  used  these  alignments  to  start  a 
generated  list  of  pairs  with  only  one  Russian  word 
and  one  English  word  in  a  pair.  Of  the  aligned 
pairs,  we  only  ineluded  pairs  that  were  aligned 
with  one  another  three  or  more  times.  Only  one- 
to-one  alignments  would  count  toward  the  three 
alignment  rule.  We  also  removed  entries  where  the 
English  word  in  the  pair  occurred  in  a  list  of  stop- 
words  as  well  as  where  the  English  word  consisted 
of  only  digits.  To  the  generated  list,  we  also  added 
pairs  directly  from  the  wiki  list  with  both  single 
Russian  words  and  single  English  words.  Einally, 
we  also  added  the  highest  quality  pairs  from  the 
transliteration-mined  list  described  below. 

Upon  encountering  a  single  word  without  word- 
internal  punctuation,  the  system  first  searches 
through  the  generated  list,  and  returns  a  list  of 
found  guesses.  If  no  items  are  found  in  the  gen¬ 
erated  list,  the  wiki  list  is  then  searched.  If  still  no 
guesses  are  found,  then  the  transliteration-mined 
list  is  searched.  The  same  process  occurs  for  a 
word  containing  word-internal  punctuation,  but  af¬ 
ter  a  failed  iteration  of  the  search  process,  the 
punctuation  is  replaced  with  a  space  and  the  wiki 
lists  are  searched.  Einally  if  that  iteration  fails, 
then  the  search  process  occurs  on  each  individual 
word  and  a  concatenation  of  English  definitions  is 
added  to  the  guess  list  for  every  possible  combi¬ 
nation  of  guesses  for  each  component  word.  An 
English  language  model  is  used  to  choose  among 
the  guesses. 

2.8.2  Permissive  Lookup  and  Selective 

Transliteration  of  Unknown  Words 

The  second  step  focuses  on  selective  translitera¬ 
tion  of  NE  among  the  OOV  words.  We  hypothe¬ 
size  that  retaining  transliterated  forms  of  NE  will 
improve  readability,  even  if  the  output  is  not  a  di¬ 
rect  match  to  the  English  reference. 

The  named  entity  pairs  were  harvested  from  the 
Common  Crawl  using  NE  tagging  and  rule-based 
transliteration  matching.  We  used  my  stem  to 
tag  Russian  NE  words,  and  then  compared  them 
to  capitalized  English  words  in  the  parallel  sen¬ 
tence,  using  an  edit  distance  based  on  the  typical 
sound  values  for  the  Cyrillic  and  Eatin  letters.  We 
also  searched  for  transliteration  matches  for  cap¬ 
italized  words  that  were  not  tagged  by  Mystem, 
excluding  sentence-initial  words.  Transliteration 
matches  with  a  zero  edit  distance  were  added  to 
the  list  of  NE  pairs  used  in  the  initial  named  entity 


processing,  above.  In  this  second  step,  we  expand 
the  list  to  include  pairs  with  greater  edit  distance 
when  they  are  validated  by  repeat  occurrence. 

If  named  entity  processing  as  outlined  in  Sec¬ 
tion  2.8.1  was  not  applied,  the  named  entity  pairs 
list  was  expanded  by  adding  the  Wikinames  and 
Wiki  titles  lists. 

After  permissive  named  entity  lookup,  we  at¬ 
tempt  to  distinguish  NE  from  common  words  on 
the  basis  of  capitalization  in  the  Russian  source 
file.  Capitalized  words  that  do  not  begin  a  sen¬ 
tence  are  assumed  to  be  NE,  and  are  transliterated. 
Eowercased  words,  and  capitalized  words  that  be¬ 
gin  a  sentence,  are  assumed  to  be  common  words 
and  are  dropped  from  the  output. 

3  Results 

We  submitted  three  systems  for  evaluation,  each 
employing  a  different  decoding  strategy:  Tradi¬ 
tional  phrased-based.  Hierarchical,  and  Eactored 
phrased-based.  Each  system  is  described  be¬ 
low.  Automatically  scored  results  reported  in 
BEEU  (Papineni  et  ah,  2002)  for  our  submission 
systems  can  be  found  in  Table  3. 

Einally,  as  part  of  WMT15,  the  results  of  our 
submission  systems  listed  in  Tables  3  were  ranked 
by  monolingual  human  judges  against  the  machine 
translation  output  of  other  WMT15  participants. 
These  judgements  are  reported  in  WMT  (2015). 

3.1  Phrased-Based 

We  used  a  standard  phrase  based  approach,  us¬ 
ing  lowercased  data.  The  lemma-based  phrase  ta¬ 
ble  described  in  Section  2.4  was  used  as  a  back¬ 
off  phrase  table.  We  trained  a  hierarchical  lexi- 
calized  reordering  model,  and  used  two  separate 
class  based  (factored)  language  models;  one  us¬ 
ing  600  classes  on  the  in-domain  target-side  par¬ 
allel  data,  and  the  other  using  the  EDC  Gigaword- 
English  v5  NYT  corpus.  N-best  lists  from  moses 
were  rescored  with  4-way  NNJMs,  and  the  sys¬ 
tem  weights  were  tuned  with  PRO  (Hopkins  and 
May,  2011).  Selective  transliteration  as  described 
in  Section  2.8.2  was  then  applied  to  the  decoder 
output. 

3.2  Hierarchical 

New  for  this  year,  we  trained  a  hierarchical  sys¬ 
tem  using  the  same  parallel  data  as  our  phrase- 
based  systems.  The  rule  table  was  created  as  out¬ 
lined  in  Section  2.4  and  then  filtered  to  only  con- 


System 

Process  Applied 

baseline  BLEU 

postproc  BLEU 

A  BLEU 

phrase-based 

PermLookup  -i-  SelTranslit 

27.72 

28.20 

+0.48 

hiero 

PermLookup  -i-  SelTranslit 

27.43 

27.91 

+0.48 

pb-factored 

NEProc  -1-  PermLookup-i-  SelTranslit 

27.18 

27.75 

+0.57 

Table  2:  NEProc  +  SelTranslit  Post-processing  improvement  measured  in  uncased  BLEU 


tain  rules  relating  to  the  Russian  content  of  the 
newstest  test  set  for  years  2012-2015.  This  fil¬ 
tering  was  performed  in  order  to  reduce  the  size 
of  the  rule  table  for  both  system  memory  require¬ 
ments  and  expediency.  The  incremental-search  al¬ 
gorithm  (Heafield  et  al.,  2013)  and  BigLM15  were 
used  to  decode  the  dev  (newstest2  014)  and 
test  (newstest2015)  data.  Drem  was  employed 
to  optimize  this  system  with  Expected  BLEU  and 
Expected  Meteor(Denkowski  and  Lavie,  2014) 
metrics.  Einally,  selective  transliteration  was  em¬ 
ployed  as  described  in  Section  2.8.2. 

3.3  Factored  Phrase-Based 

Eor  our  last  system,  we  used  a  factored  phrase 
based  approach  (Koehn  and  Hoang,  2007)  where 
the  surface  form  of  the  training  data  was  aug¬ 
mented  with  word  classes.  These  classes  were 
generated  on  the  parallel  training  data  outlined  in 
Section  2.4  using  mkcls  to  group  the  words  into 
600  classes  for  both  English  and  Russian  portions 
of  the  parallel  training  corpus.  A  phrase  table  and 
hierarchical  reordering  model  was  then  trained  us¬ 
ing  the  mo  s  e  s  training  process  on  both  the  surface 
form  and  the  class  factor.  Order-5  operation  se¬ 
quence  models  were  separately  trained  on  the  sur¬ 
face  forms  and  the  class  factors.  An  order-6  class- 
factor  EM  (Shen  et  al.,  2006)  was  also  trained  on 
the  English  portion  of  the  parallel  training  data  to 
supplement  the  use  of  BigLM15.  NNJMs  as  out¬ 
lined  in  Section  2.7  were  used  to  rescore  the  n-best 
lists  from  the  decode.  Eohowing  this  rescoring, 
Drem  was  employed  to  optimize  feature  weights 
using  the  Expected  Corpus  BLEU  metric  (Smith 
and  Eisner,  2006).  After  optimization  and  decod¬ 
ing  of  the  test  set,  remaining  unknown  words  were 
processed  as  described  in  Sections  2.8.1  and  2.8.2. 

4  Discussion 

Our  three  submitted  systems  all  scored  similarly 
against  the  official  tesf  set.  Manual  examination 
of  our  systems’  output  shows  that  there  are  signif¬ 
icant  differences  in  sentence  structure  and  content. 


System 

Cased  BLEU 

Uncased  BLEU 

phrase-based 

27.0 

28.2 

hiero 

26.7 

27.9 

pb-factored 

26.4 

27.8 

Table  3:  MT  Submission  Systems  decoding 

newstest2015 


We  scored  one  system  output  against  another 
(as  reference)  with  mtevall3a.pl  in  both  di¬ 
rections  as  BLEU  scores  are  not  symmetric.  Re¬ 
sults  are  listed  in  Table  4.  Interestingly,  the  fac¬ 
tored  phrase  based  and  hierarchical  systems  were 
more  similar  to  each  other  than  to  the  traditional 
phrase-based  system.  This  suggests  that  the  addi¬ 
tion  of  class  factors  serves  a  similar  function  to  the 
use  of  hierarchical  decoding. 


Test 

Ref 

BLEU 

PB 

Hiero 

57.18 

PBEac 

Hiero 

76.34 

Hiero 

PB 

57.09 

PBEac 

PB 

60.54 

PB 

PBEac 

60.47 

Hiero 

PBEac 

70.18 

Table  4:  Submission  system  similarity  measured 
in  uncased  BLEU 

It  will  be  interesting  to  see  the  results  of  human 
evaluation  on  three  markedly  different  systems. 

5  Conclusion 

In  this  paper,  we  present  data  preparation  and  pro¬ 
cessing  techniques  for  our  Russian-English  sub¬ 
missions  to  the  2015  Workshop  on  Machine  Trans¬ 
lation  (WMT15)  shared  translation  task.  Our  sub¬ 
missions  examine  three  different  decoding  strate¬ 
gies  and  the  effectiveness  of  sophisticated  han¬ 
dling  of  unknown  words.  While  scoring  similarly, 
each  system  produced  markedly  different  output. 
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